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ABSTRACT 

Accurate photometric redshifts are a lynchpin for many future experiments to pin down the 
cosmological model and for studies of galaxy evolution. In this study, a novel sparse regres¬ 
sion framework for photometric redshift estimation is presented. Synthetic dataset simulating 
the Euclid survey and real data from SDSS DR12 are used to train and test the proposed 
models. We show that approaches which include careful data preparation and model design 
offer a significant improvement in comparison with several competing machine learning al¬ 
gorithms. Standard implementations of most regression algorithms use the minimization of 
the sum of squared errors as the objective function. For redshift inference, this induces a bias 
in the posterior mean of the output distribution, which can be problematic. In this paper we 
directly minimize the target metric Az = (zs — z:p)/(l + Zs) and address the bias problem 
via a distribution-based weighting scheme, incorporated as part of the optimization objective. 
The results are compared with other machine learning algorithms in the field such as Artificial 
Neural Networks (ANN), Gaussian Processes (GPs) and sparse GPs. The proposed framework 
reaches a mean absolute Az = 0.0026(1 + Zs), over the redshift range of 0 ^ Zs ^ 2 on the 
simulated data, and Az = 0.0178(1 -f Zs) over the entire redshift range on the SDSS DR12 
survey, outperforming the standard ANNz used in the literature. We also investigate how the 
relative size of the training sample affects the photometric redshift accuracy. We find that a 
training sample of >30 per cent of total sample size, provides little additional constraint on 
the photometric redshifts, and note that our GP formalism strongly outperforms ANNz in the 
sparse data regime for the simulated data set. 
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1 INTRODUCTION 

The radial component of the position of a distant object is inferred 
from its cosmological redshift, induced by the expansion of the 
Universe; the light observed from a distant galaxy appears to us 
at longer wavelengths than in the rest frame of that galaxy. The 
most accurate determination of the exact redshift, z, comes from 
directly observing the spectrum of an extragalactic source and mea¬ 
suring a consistent multiplicative shift, relative to the rest frame, 
of various emission (or absorption) features. The rest-frame wave¬ 
lengths of these emission lines are known to a high degree of accu¬ 
racy which can be conferred onto the measured spectroscopic red¬ 
shifts, Zj. However, the demand on telescope time to obtain spectra 
for every source in deep, wide surveys is prohibitively high, and 
only relatively small area spectroscopic campaigns can reach faint 
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magnitudes (e.g. lUillv et alj|200^ : lUe Fevre et alj[2013l . 1201^ . or 
at the ot her extreme, relatively bright magnitudes over larger ar - 
eas (e.g. IColless et alj|2003l : [Driver et al1l201 ihlAlam et all 2015h . 
This forces us towards the use of photometric observations to in¬ 
fer the redshift by other means. Rather than individual spectra, the 
emission from a distant galaxy is observed in several broad filters, 
facilitating the characterization of the spectral energy distribution 
(SED) of fainter sources, at the expense of fine spectral resolution. 

Photometric redshift methods largely fall into two 
categories, based on either SED template fitting or ma- 
chine learning. Template fitting software suc h as Hyperz; 
teolzonella, Miral les & Pel i3 2000h. ZEBRA: jpeldmann et alj 
l2006h . EAZ Y; (Bramm e r, van Dokkum & Coppil 2008h and Le 
Phare Gilbert et alj bOOfili rely on a library of SED templates 
for a variety of different galaxy types, which (given the trans¬ 
mission curves for the photometric filters being used) can be 
redshifted to fit the photometry. This method can be refined 
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in various ways, often with the use of simulated SEDs rather 
than only those observed at low redshift, composite SEDs, and 
through calibration using any available spectroscopic redshifts. 
Machin e learning methods such as artificial neural networks (e.g . 


ANNz jEirth. Lahav & Somervilh 


ilollister & Laha^ 


rKs (e.g . 
■v|2004 . 


nearest-neighbour (NN) i Ball et al.ll2008h . genetic algorith ms (e.g. 
Hogan. Eairbaim & Seeburr 2015h. self-organized maps jGeachl 
2012h and random forest l Kind & Brunneii l2013h . to name but 


a few, rely on a significant fraction of sources in a photometric 
catalogue having spectroscopic redshifts. These ‘true’ redshifts 
are used to train the algorithm. In addition to providing a point 
estimate, machine learning methods can provide the degree of un¬ 
certainty in their pred iction jKind & Brunnei|[2013h|Bonnett et al] 
|2015l;lRauetZll2015h . 

Both methods have their strengths and weaknesses, with the 
best performance often depending on the available data and the 
intended science goals. As such, future surveys may well depend 
on contributions from both in tandem, but there has been extensive 
work on comparing the curr ent state of the art in publ i c software us¬ 
ing a variety of techniques (iHildebrand^et^ 20 id : lAbdalla et alj 
I 2 OI ll : I Sanchez et al.l l2014l : iBonnett et alj|2015l) . Artificial neural 
networks motivate the most commonly u s ed machine learning 
software 1 Firth. L^av & Somerville! l2003l : IVanzella et al. 200jJ 
Brescia et alJl2014li . however Gaussian Processes le.g. Iwav et alj 


2 OO 9 I) have no t yet become well es tablished in this area, despite 


comparison bv iBonfield et alJ l l2010h suggesting that they may out¬ 
perform the popular ANNz code, using the rms error as a metric. 

In this paper, we introduce a novel sparse kernel regression 
model that greatly reduces the number of basis (kernel) functions 
required to model the data considered in this paper. This is achieved 
by allowing each kernel to have its own hyper-parameters, gov¬ 
erning its shape. This is in contrast to the standard kernel-based 
models in which a set of global hyper-parameters are optimized 
(such as is typical in Gaussian Process (GP) methods). The com¬ 
plexity cost of such a kernel-based regression model is O {n^), 
where n is the number of basis functions. This cubic time com¬ 
plexity arise from the cost of inverting an n by n covariance ma¬ 
trix. I n a standard Gaussian Process model d^smussen & WilliamsI 
l200d) . seen as a kernel regression algorithm, we may regard the 
basis functions, as located at the n points in the training sample. 
This renders such an approach unusable for many large training 
data applications where scalability is a major concern. Much of the 
work done to make GPs more scalable is either to (a) make the 
inverse computation faster or (b) use a smaller representative set 
from the training sample to reduce the rank and ease the compu¬ 
tation of the covariance matrix. Examples of (a) include methods 
such as structuring the co variance matrix such that it is mu ch easier 
to invert, using ToeplitzjlZhangMxjthead & Ixj^ l2005h or Kro- 
necker decomposition jTsiligkaridis & H ercill2013 l), or inverse ap - 
proximation as an optimization problem ( Gibbs & MacKavll 19971) . 
To reduce the number of representative points (b), an m <C n sub¬ 
set of the training sample can be selected which m aximizes the 
accur acy or the numerical stability of the inversion dFoster et alj 
l2009li . Alternatively, one may search for “inducing” points not nec¬ 
essarily present in the training sample, and not necessarily even 
lying within the data range, to use as the basis set such that the 
probabi lity of the data being generat ed from the model is maxi¬ 
mized dSnelson & Ghahramanill2006h. Appro aches such as Rele¬ 
vance Vector Ma chines (RVM; Tipping 200 ih and Support Vector 
Machines (SVM; ISmola & VapniM 19971) are basis-function mod¬ 
els. However, unlike sparse GPs, they do not learn the basis func¬ 
tions’ locations but rather apply shrinkage to a set of kernels in the 


form of weight-decay on the linear weights that couple the kernels, 
located at training data points, to the regression. 

In this paper, we propose a non-stationary sparse Gaussian 
model to target photometric redshift estimation. The key difference 
between the proposed approach and other basis function models, 
is that our model does not use shrinkage (automatic relevance de¬ 
termination) external to the kernel, but instead has a length-scale 
parameter in each kernel. This allows for parts of the input-output 
regression mapping to have different characteristic length-scales. 
We can see this as allowing for shrinkage and reducing the need 
for more basis functions, as well as allowing for non-stationary 
mappings. A regular GP, sparse GP or RVM does not do this, and 
we demonstrate that this is advantageous to photometric redshift 
estimation. Furthermore, the model is presented within a frame¬ 
work with components that address other challenges in photomet¬ 
ric redshift estimation such as incorporating a weighting scheme 
as an integral part of the process to remove, or introduce, any sys¬ 
tematic bias, and a prior mean function to enhance the extrapola¬ 
tion performance of the model. The results are demonstrated on 
phot ometric redshift est imation for a simulated Euclid-\ike, sur¬ 
vey jLaureiis et alj|20flh and on observational data fr om the 12th 
Data Release of the Sloan Digital Sky Survey (SDSS) jAlam et alj 
ImIB . In particular, we use the weighting scheme to remove any 
distribution bias and introduce a linear bias to directly target the 
mission’s requirement. 

The paper is organised as follows, a brief introduction to Gaus¬ 
sian Processes for regression is presented in Sectionl^followed by 
an introduction to sparse GPs in Section]^ The proposed approach 
is described in Section|4] followed by an application to photomet¬ 
ric redshift estimation in Section |5] where the details of the mock 
dataset are described. The experiments and results are discussed in 
Sectionl^on the simulated survey, and in Section|7]we demonstrate 
the performance of the proposed model and compare it to ANNz on 
the SDSS 12th Data Release. Finally, we summarize and conclude 
in Section [S] 


2 GAUSSIAN PROCESSES 

In many modelling problems, we have little prior knowledge of 
the explicit functional form of the function that maps our observ¬ 
ables onto the variable of interest. Imposing, albeit sensible, para¬ 
metric models, such as polynomials, makes a tacit bias. For this 
reason, much of modem function modelling is performed using 
non-parametric techniques. For regressio n, the most widely used 
appro ach is that of Gaussian Processes l lRasmussen & WilliamsI 
l2006h . A Gaussian Process is a supervised non-linear regression 
algorithm that makes few explicit parametric assumptions about 
the nature of the function fit. For this reason, Gaussian Processes 
are seen as lying within the class of Bayesian non-parametric mod¬ 
els. The underlying assumption in a GP is that, given a set of input 
X = and a set of target outputs y = G 

K", where n is the number of samples in the dataset and d is the 
dimensionality of the input, the observed target yt is generated by 
a function f of the input Xi plus additive noise a: 

yi = { (xi) -I- a. (1) 

The noise e is taken to be normally distributed with a mean of zero 
and variance cr^, or e AC (O, cr^). To simplify the notation, it is 
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assumed that y ~ A/” (0,1) (this can readily be achieved without 
loss of generality, via a linear whitening process) and univariate, al¬ 
though the derivation can be readily extended to multivariable prob¬ 
lems. The conditional probability of the observed variable given the 
function is hence distributed as follows: 


p(y|f) ~ A/'(f,cr^) . 


( 2 ) 


A GP then proceeds by applying a Bayesian treatment to the prob¬ 
lem to infer a probability distribution over the space of possible 
functions f given the data: 


p(f|y,x) 


p(y|f)p(f|x) 

p(y|x) 


(3) 


This requires us to define a prior, p (f |X), over the function 
space. The function is normally distributed with a mean of zero, 
to match the mean of the normalized variable y, with a covariance 
function K, i.e. p (f |X) ~ A/" (0, K). The covariance function cap¬ 
tures prior knowledge about the relationships between the observ¬ 
ables. Most widely used covariance functions assume that there is 
local similarity in the data, such that nearby inputs are mapped to 
similar outputs. The covariance K can therefore be modelled as 
a function of the input X, K = fc (X, X), where each element 
Kij = k {:Ki, Xj) and k is the covariance function. For K to be a 
valid covariance it has to be symmetric and positive semi-definite 
matrix; arbitrary functions for k cannot guarantee these constraints. 
A class of functions that guara ntees these stru ctural constraints are 
referred to as Mercer kernels Im erce A commonly used 

kernel function, which is the focus of this work, is the squared ex¬ 
ponential kernel, defined as follows: 


k {xi,Xj) = exp 




(4) 


where cr and A are referred to as the height (output, or variance) and 
characteristic length (input) scale respectively, which correspond to 
tunable hyper-parameters of the model. The similarity between two 
input vectors, under the squared exponential kernel, is a non-linear 
function of the Euclidean distance between them. We note that this 
choice of kernel function guarantees continuity and smoothness in 
the function and all its derivatives. For a m ore extensive discussion 
of co vari ances, the reader i s referred to jRasmussen & Williams! 
l2006h or ([Roberts et alj2013h .With the likelihood p (y | f) and prior 
p (f |X), the marginal likelihood p (y|X) can be computed as fol¬ 
lows: 


P(y|X) = J p(y|f)p(f|X)df, 


(5) 


By multiplying the likelihood and the prior and completing the 
square over f, we can express the integration as a normal dis¬ 
tribution independent of f multiplied by a another normal dis¬ 
tribution over f. The distribution independent of f can then be 
taken out of the integral, and the integration of the second nor¬ 
mal distribution with respect to f will be equal to one. The result- 
ing distribution of the margina l likelihood is distributed as follows 
( [Rasmussen & WilliamsI200^1 : 

p(y|X)-Af(0,K + a2l). (6) 

The marginal likelihood of the full data set can hence be computed 
as follows: 


p(y|x) = n 

i=l 


1 

— , exp 

^27r |K + all] 


(K-Fcr^l) 

(V) 


The aim of a GP, is to maximize the probability of observing 
the target y given the input X, Eq. 0- Note that the only free pa¬ 
rameters to optimize in the marginal likelihood are the parameters 
of the kernel and the noise variance, collectively referred to as the 
hyper-parameters of the model. It is more convenient however to 
maximize the log of the marginal likelihood, Eq. 0, since the log 
function is a monotonically increasing function, maximizing the 
log of a function is equivalent to maximizing the original function. 
The log likelihood is given as: 

log p(y|X) =-iy"^ (K + Icr^) V 

- ^ log |K + lo-^l - ^ log(27r). (8) 

We search for the optimal set of hyper-parameters using a 
gradient-based optimization, hence we require the derivatives of 
the log marginal likelihood with respect to each hyper-parameter. 
In this paper, the L-BFGS algorithm was used to optimize the ob¬ 
jective which uses a Quasi-Newton method to compute the search 
direction in each step by approximating the inverse of th e Hessian 
matrix from the hist ory of gradients in previous steps ([Nocedaj 
[19801 : [Schmidl[[20051) . It is worth mentioning that non-parametric 
models require the optimization of few hyper-parameters that do 
not grow with the size of the data and are less prone to overfit¬ 
ting. The distinction between parameters and hyper-parameters of 
a model is that the former directly influence the input-output map¬ 
ping, for example the linear coupling weights in a basis function 
model, whereas the latter affect properties of distributions in the 
probabilistic model, for example the widths of kernels. Although 
this distinction is somewhat semantic, we keep to this nomencla¬ 
ture as it is standard in the statistical machine learning literature. 

Once the hyper-parameters have been inferred, the conditional 
distribution of future predictions f* for test cases X* given the 
training sample can be inferred from the joint distribution of f* 
and the observed targets y. If we assume that the joint distribution 
is a multivariate Gaussian, then the joint probability is distributed 
as follows: 


p(y,f*|X,X*)-Af 0 


Kxx + rr^I 



(9) 


where we introduce the shorthand notations Kxx = k (X, X), 
Kx* = fc(X,X*), K*x = fc(X*,X) andK** = fc(X*,X*). 
The conditional distribution p(f*|y,X,X*) is therefore dis¬ 
tributed as follows: 

p(f*|y.X,X*) ~ Ar(/x,E), 

Pi = K*x (Kxx + ^y, 

S = K** - K*x (Kxx + Kx*. 

( 10 ) 

If we assume a non-zero prior mean /if over the function, p{f) ~ 
A/" (/If, K), and an un-normalized y with mean /ly, the mean of 
the posterior distribution will be equal to: 

/I =/If + K*x (Kxx + (T^I) ^(y-/iy), (11) 

The main drawback of GPs is the O (rd) computational cost 
required to invert the nxn matrix Kxx+f^I. The sparse Gaussian 
Process allows us to reduce this computational cost and is detailed 
in the following section. 






















4 Almosallam et al. 


3 SPARSE GAUSSIAN PROCESSES 

Gaussian processes are often described as non-parametric regres¬ 
sion models due to the lack of an explicit parametric form. In¬ 
deed GP regression can also be viewed as a functional mapping 
X t-K parameterized by the data and the ker¬ 

nel function, followed by linear regression via optimization of the 
following objective: 

nnn i (Kw - y)^ (Kw - y)-f icr^w^w, (12) 

where w are the set of coefficients for the linear regression model 
that maps the transformed features K to the desired output y. 
The feature transformation K evaluate how “similar” a datum is 
to every point in the training sample, where the similarity mea¬ 
sure is defined by the kernel function. If two points have a high 
kernel response via Eq. this will result in very correlated fea¬ 
tures, adding extra computational cost for very little or no added 
information. Selecting a subset of the training sample that maxi- 
m izes the pres e rved i nformation i s a research question ad d ressed 
in [Foster et"^ ( l2009h . whereas in ISnelson & Ghahramanil l l2006ll 
the basis functions are treated as a search problem rather than a se¬ 
lection problem and their locations are treated as hyper-parameters 
which are optimized. These approaches result in a transformation 
X G :->■ K G R"’'™', in which m < n is the number of 

basis functions used. The transformation matrix K will therefore 
be a rectangular n by m matrix and the solution for w in Eq. (H 
is calculated via standard linear algebra as: 

(K^K + laiy^K^y. (13) 

Even though these models improve upon the computational 
cost of a standard GP, very little is done to compensate for the 
reduction in modelling power caused by the “loss” of basis func¬ 
tions. The selection method is always bounded by the full GP’s 
accuracy, on the training sample, since the basis set is a subset of 
the full GP basis set. On the other hand, the sparse GP’s ability to 
place the basis set freely across the input space does go some way 
to compensate for this reduction, as the kernels can be optimized 
to describe the distribution of the data. In other words, instead of 
training a GP model with all data points as basis functions, or re¬ 
stricting it to a subset of the training sample which require some 
cost to select them, a set of inducing points is used in which their 
locations are treated as hyper-parameters of the model to be opti¬ 
mized. In both a full and a low rank approximation GP, a global 
set of hyper-parameters is used for all basis functions, therefore 
limiting the algorithm’s local modelling capability. Moreover, the 
objective in Eq. d minimizes the sum of squared errors, therefore 
for any non-uniformly distributed output, the optimization routine 
will bias the model towards the mean of the output distribution and 
will seek to fit preferentially the region of space where there are 
more data. Hence, the model might allow for very poor predictions 
for few points in poorly represented regions, e.g. the high redshift 
range, in order to produce good predictions for well represented re¬ 
gions. Therefore, the error distribution as a function of redshift is 
not uniform unless the training sample is well balanced, producing 
a model that is sensitive to how the target output is distributed. 

In the next section, a method is proposed which addresses the 
above issues by parametrizing each basis function with bespoke 
hyper-parameters which account for variable density and/or pat¬ 
terns across the input space. This is particularly pertinent to deter¬ 
mining photometric redshifts, where complete spectroscopic infor¬ 
mation may be restricted or biased to certain redshifts or galaxy 
types, depending on the target selection for spectroscopy of the 


training sample. This allows the algorithm to learn more complex 
models with fewer basis functions. In addition, a weighting mech¬ 
anism to remove any distribution bias from the model is directly 
incorporated into the objective. 


4 PROPOSED APPROACH 

In this paper, we extend the sparse GP approach by modelling each 
basis (kernel) function with its own set of hyper-parameters. The 
kernel function in Eq. is hence redefined as follows: 



where P = G are the set of basis coordinates 

and \j is the corresponding length scale for basis j. The mul¬ 
tivariate input is denoted as X = G R"^‘*. Through¬ 

out the rest of the paper, Xi_* denotes the i-th row of matrix 
X, or Xi for short, whereas X*j denotes the j-th column and 
Xi,j refers to the element at row i and column j in matrix X, 
and similarly for other matrices. Note that the hyper-parameter a 
has been dropped, as it interferes with the regularization objec¬ 
tive. This can be seen from the final prediction equation {ji — 
Wjcr|exp (—||xi — pjll^/2Aj), the weights are always 
multiplied by their associated a. Therefore, the optimization pro¬ 
cess will always compensate for decreasing Wj by increasing cr|. 
Dropping the height variance ensures that the kernel functions do 
not grow beyond control and delegates learning the linear coeffi¬ 
cients and regularization to the weights in w. The derivatives with 


respect to each length scale and position are provided in equations 
Eq. lll5db and Eq. lll5et respectively: 

E = ^(Kw — y) o K, 

(15a) 

= X InPj, 

(15b) 

Dij = ||xi - PjII^ , 

(15c) 

9/(X,y,w) TV v -3 

d\, - ’ 

(15d) 

a/(X,y,w)_j^, , 

dpj ^ 

(15e) 

The symbol o denotes the Hadamard product, i.e. element-wise ma¬ 
trix multiplication and denotes a column vector of length n with 
all elements set to 1. Finding the set of hyper-parameters that opti- 


mizes the solution, is in effect finding the set of radial basis func¬ 
tions defined by their positions p and radii A that jointly describe 
the patterns across the input space. By parametrizing them differ¬ 
ently, the model is more capable to accommodate different regions 
of the space more specifically. A global variance model assumes 
that the relationship between the input and output is global or equal 
across the input space, whereas a variable variance model, or non¬ 
stationary GP, makes no assumptions and learns the variable vari¬ 
ances for each basis function which reduces the need for more basis 
functions to model the data. The kernel in Eq. GD can be further 
extended to, not only model each basis function with its own radius 
\j, but also model each one with its own covariance Cj G 
This enables the basis function to have any arbitrary shaped ellipses 
giving it more flexibility. The kernel in Eq. | |14| | can be extended as 
follows: 

fc(xi,pj) = exp f-i (xi - Pj)CJ^ (xi -pj)^y (16) 
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Furthermore, to make the optimization process faster and sim¬ 
pler, we define the additional variables: 

cf = A,AJ, (17a) 

V, = A,A,-, (17h) 

where A.j € is a local affine transformation matrix for basis 

function j and Vj is the application of the local transformation to 
the data. Optimizing with respect to Aj directly ensures that the 
covariance matrix is positive definite. This makes it faster from a 
computational perspective as the kernel functions for all the points 
with respect to a particular basis can be computed more efficiently 
as follows: 

fc(X,Pj) = exp (VjoVj) Id^. ( 18 ) 

The exponent in Eq. l ll8t basically computes the sum of squares in 
each row of Vj. This allows for a more efficient computation of the 
kernel functions for all the points in a single matrix operation. The 
derivatives with respect to each Aj and pj are shown in Eq. l ll9at 
and Eq. J19bt respectively. 


a/(X,y,w) 

dAj 

a/(X,y,w) 

dpj 


(aJo 


V,- 


— E V A 


(19a) 

(19b) 


Setting up the problem in this manner allows the setting of 
matrix Aj to be of any size d by q, where q < d which can be 
considered as a low rank approximation to C~^ without affecting 
the gradient calculations. In addition, the inverse of the covariance 
can be set to C~^ — AjAJ + diag{\j)~^ in the low rank ap¬ 
proximation case to ensure that the final covariance can model a 
diagonal covariance. This is re ferred to as factor analysis distance 
( IRasmussen & WilliamsI20061 p. 107) but previously used to model 
a global covariance as opposed to variable covariances as is the case 
here. 


4.1 Prior Mean Functions 

In the absence of observations, all Bayesian models, Gaussian pro¬ 
cesses included, rely on their priors to provide function estimation. 
For the case of Gaussian processes this requires us to consider the 
prior over the function, especially the prior mean. For example, the 
first term in the mean prediction in Eq. is our prior mean in 

which we learn the deviation from using a GP. Similarly, we may 
consider a mean function that is itself a simple linear regression 
from the independent to dependent variable. The parameters of this 
function are then inferred and the GP infers non-linear deviations. 
In the absence of data, e.g. in extrapolative regions, the GP wi ll 
fall back to the linear regression prediction ([Roberts et alj|2013h . 
We can incorporate this directly into the optimization objective in¬ 
stead of having it as a separate preprocessing step by redefining K 
as a concatenation of the linear and non-linear features, or setting 
K = [K|X|1„] and w = [w|wi,|6], where w_l is the linear re¬ 
gression’s coefficients and b is the bias. The prediction can then be 
formulated as follows: 


Kw = Kw + Xwl -f h. (20) 

Furthermore, the regularization matrix, Icr^, in Eq. (ED can 
be modified so that it penalises the learning of high coefficients 
for the non-linear terms, w, but small or no cost for learning high 


linear terms, wr, and h, by setting the corresponding elements in 
the diagonal of I to 0 instead of cr^, or the last d -|- 1 elements. 
Therefore, as cr^ goes to infinity, the model will approach a simple 
linear regression model instead of fallen back to zero. 


4.2 Cost-Sensitive Learning 


Thus far in the discussion, we make the tacit assumption that 
the objective of the inference process is to minimize the sum of 
squared errors between the model and target function values. Al¬ 
though this is a suitable objective for many applications, it is in¬ 
trinsically biased by uneven distributions of training data, sacri¬ 
ficing accuracy in less represented regions of the space. Ideally 
we would like to train a model with a balanced data distribution 
to avoid such bias. This however, is a luxury that we often do 
not have. Eor example, the lack of strong emission lines that are 
detectable with visible-wavelength spectrographs in the “redshift- 
desert” at 1.2 < z < 1.8 means that this redshift range is often 
under-represented in spectroscopic samples. A common technique 
is to either over-sample or under-s ample the data to achieve balance 
dWeiss. McCarthy & Zabaill2007l) . In under-sampling, samples are 
removed from highly represented regions to achieve balance, over- 
sampling on the other hand duplicates under represented samples. 
Both approaches come with a cost; in the former good data are 
wasted and in the latter more computation is introduced due to the 
data size increase. In this paper, we perform cost-sensitive learning, 
which increases the intrinsic error function in under-represented re¬ 
gions. In regression tasks, such as we consider here, the output can 
be either discretized and treated as classes for the purpose of cost 
assignment, or a specific bias is used such as 1 / (1 -|- as). To mimic 
a balanced data set in our setup, the galaxies were grouped by their 
spectroscopic redshift using non-overlapping bins of width 0.1. The 
weights are then assigned as follows for balanced training: 


max({/i,...,/B}) 
{fb-.ie Sb} 


( 21 ) 


where Wi is the error cost for sample i, fb is the frequency of sam¬ 
ples in bin number number i, B is the number of bins and Sb is 
the set of samples in set number b. Eq. IID assigns a weight to 
each training point which is the maximum bin frequency over the 
frequency of the bin in which the source belongs. This ensures that 
the error cost of source i is inversely proportional to its spectro¬ 
scopic redshift frequency in the training sample. The normalized 
weights are assigned as follows: 


Wi = 


1 

iT^ 


2 


( 22 ) 


After the weights have been assigned, they can be incorpo¬ 
rated directly into the objective as follows: 


rmn i (Kw - y)^ W (Kw - y)-f icr^w^w. (23) 

The difference between the objectives in Eq. | |12| | and Eq. l l23l l 
is the introduction of the diagonal matrix W, where each element 
Wii is the corresponding cost Wi for sample i. The first term in 
Eq. m is a matrix operation form for a weighted sum of squares 
(Ki,*w — y if , where the solution can be found analyt¬ 
ically as follows: 

w = (^K'^WK + Icr^) K^Wy. (24) 


The only modification to the gradient calculation is to set the matrix 
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E = W ((Kw — y) w^) o K. In standard sum of squared errors, 
W = I or the identity matrix. It is worth emphasising that this 
component of the framework does not attempt to weight the train¬ 
ing sample in order to match the distribution of the test sample, or 
matching the spectros copic distribution t o the photometric d istri- 
bution as proposed in lLima et al.l ([2008[)JCunha et_al ( l2009h and 
applied to photometric redshift in I Sanchez et al (2014), but rather 
gives the user of the framework the ability to control the cost per 
sample to serve different science goals depending on the applica¬ 
tion. In this paper, the weighting scheme was used for two different 
purposes, the first was to virtually balance the data to mimic train¬ 
ing on a uniform distribution, and the second was to directly target 
the weighted error of 1/ (1 -|- Zg). 


5 APPLICATION TO PHOTOMETRIC REDSHIFT 
ESTIMATION 



(a) Full 


In this section, we specifically target the photometric bands and 
depths planned for Euclid. Euclid aims to provide imaging data in 
a broad RIZ band and the more standard near-infrared Y, J and H 
bands, while groun d-based ancillary dat a are expected in the optical 
g, r, i and z bands dLaureiis et al.ll201lh . 


5.1 Simulated Dataset 

We use a mock dataset from ljouvel et alj d2009l) . consisting of the 
g, r, i, z, RIZ, Y, J and H magnitudes (to lOcr depths of 24.6, 
24.2, 24.4, 23.8, 25.0 for the former, and 5a depth of 24.0 for 
each of the latter three near-infrared filters) for 185,253 simulated 
sources. We remove all sources with RIZ > 25 to simulate the tar¬ 
get magnitudes set for Euclid. In addition, we remove any sources 
with missing measurements in any of their bands prior to training 
(only 15 sources). No additional limits on any of the bands were 
used, however in Section [6^ we do explicitly impose limits on the 
RIZ band to test the extrapolation performance of the models. The 
distribution of the spectroscopic redshift is provided in Figure [T] 
For all experiments on the simulated data, we ignore the uncertain¬ 
ties on the photometry in each band and train only on the magni¬ 
tudes since in the simulated data set, unlike real datasets, the log 
of the associated errors are linearly correlated with the magnitudes, 
especially in the targeted range, therefore adding no information. 
However, they were fed as input to ANNz to satisfy the input for¬ 
mat of the code. 

We preprocess t he data using Principle Component Analysis 
(PCA; Ijolliffd 1 19861) to de-correlate the features prior to learn¬ 
ing, but retain all features with no dimensionality reduction. De- 
correlation accelerates the convergence rate of the optimization 
routine especially wh en using a logistic- type kernel machines such 
as Neural Networks (iLeCun et alJll998h . To understand this, con¬ 
sider a simple linear regression example where we would like 
to solve for w in Aw = b, the solution for this is w = 
(A^A) A^b. Note that if A is de-correlated A^ A = I, there¬ 
fore learning Wi depends only on the i-th column of A and it is 
independent from learning Wj, where i 7^ y. In an optimization ap¬ 
proach, the convergence rate is a function of the condition number 
of the A^ A matrix, which is minimized in the case of de-correlated 
data. This represents a quadratic error surface which helps accel¬ 
erate the search. This is particularly important in the application 
addressed in this paper because the magnitudes measured in each 
filter are strongly correlated with each other. 



2s 


(b) RIZ <23 



(c) RIZ <22 

Figure 1. The spectroscopic redshift distribution of the (a) full dataset, (b) 
sources with RIZ magnitude<23 and (c) sources with RIZ magnitude<22 
from the simulated data 
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Table 1. The time complexity of each approach. 

Method Time Complexity 


ANNz (1-layers) 

STABLEGP 

GP-GL 

GP-VL 

GP-VC 


O (^nmd + (Z — 

O (nm?) 

O [nmd + n-m?) 
O (nmd + nm^^ 
O (nmd? + nm?) 


6 RESULTS ON THE SIMULATED DATA 


Five algorithms are considered to model the data; Artificial Neural 
Networks (ANNzi fCollister & Lahav 2()04l). a GP with low rank 
approximation (STABLEGP; Foster et ^ lioogh . a sparse GP with 
global length scale (GP-GL), a GP with variable length scale (GP- 
VL) and a GP with variable covariances (GP-VC). For ANNz, a 
single layer network is used, and to satisfy the input format for the 
code, the data were not de-correlated and the uncertainties on pho¬ 


tometry for each band were used as part of the traini ng input. For 
STABL EGP, we use the SR-VP method proposed in [Foster et alj 
ll2009h . In subsequent tests, the variable m refers to the number of 
hidden units in ANNz, the rank in STABLEGP, and the number of 
basis functions in GP-GL, GP-VL and GP-VC. The time complex¬ 
ities for each algorithm are shown in Table [T] The data were split 
at random into 80 per cent for training, 10 per cent for validation 
and 10 per cent for testing. We note that we investigate the accu¬ 
racy for various training sample sizes in Section [6A1 All models 
were trained using the entire redshift range available, but we only 
report the performance on the reds hift range of 0 ^ gs ^ 2 to tar¬ 
get the parameter space set out in lLaureiis et alj (1201 ih . We train 
each model for 500 iterations in each run and the validation sample 
was used for model selection and parameter tuning, but all the re¬ 
sults here are reported on the test sample, which is not used in any 
way during the training process. Tablel^shows the metrics used to 
report the performance of each algorithm. 


6.1 Modelling Performance 

In the first test, all models were trained using a fixed m = 10 
to cross-compare the performance of the methods using the same 
number of basis functions. The number of basis functions was set 
deliberately low to highlight the sparse-limit modelling capabili¬ 
ties of each algorithm, as for large values of m the performance 
gap between the algorithms reduces making it harder to compare 
the strengths and weaknesses of each algorithm. The standard sum 
of squares objective was used, without cost-sensitive learning or a 
prior mean function to keep the comparison as simple as possible. 
The Zs versus Zp density scatter plots are shown in Figure and 
their performance scores of each algorithm are reported in Tablej^ 
Figurel^clearly shows the advantage of using inducing points over 
an active set when we compare GP-GL’s performance in Figure 
[^with STABLEGP’s performance in Figure |2H The problem for¬ 
mulation of the two are identical, except that GP-GL’s basis set is 
learned as part of the optimization objective whereas STABLEGP’s 
basis set is pre-selected in an unsupervised fashion. 


6.2 Prior Mean 

We also test the extrapolation performance of the GP-VC model us¬ 
ing different prior means, namely a zero mean, a linear regression 


prior and a joint optimization approach that learns the linear and 
non-linear features simultaneously by regularizing the non-linear 
features more aggressively than linear features and compares them 
with ANNz. The difference between the linear regression prior and 
the joint optimization approach, is that the former first fits a linear 
model to the data then subtracts the predictions from the ground 
truth before training a GP model, whereas the latter learns both 
the linear model and the non-linear deviations from it jointly. To 
test this more effectively, the models were trained using sources, 
with RIZ < 23 (29,024 objects from the training sample) and 
tested on the unseen samples with RIZ < 23, RIZ is 23, and 
the entire test sample. A similar test was also conducted using a 
split of RIZ < 22 (12,056 objects from the training sample). This 
also demonstrates the effectiveness of the algorithms in the sce¬ 
nario where the brightest sources dominate the training sample, as 
may be true in practice. The results are reported in Table|4]and the 
density scatter plots are shown for comparison in Figure |3 The re¬ 
sults show that the “Joint” method consistently outperformed the 
other methods in extrapolation as well as in interpolation, espe¬ 
cially when trained with a small sample size as in the RIZ < 22 
case. Moreover, upon examining the density scatter plots in Fig- 
ure|3 it has fewer systematic and catastrophic errors than the other 
methods with a factor of ~ 2 improvement over ANNz where the 
training data are limited in magnitude/flux-density. 


6.3 Cost-Sensitive Learning 

We now perform a comparison between cost-sensitive learning and 
the normal sum of squared errors for the GP-VC model. Two dif¬ 
ferent weight configurations are tested, the first is to assign an error 
cost to each sample as in Eq. 1221 (Normalized), and the second 
experiment is to weight each sample according to the frequency 
of their true redshift to ensure balanced learning (Balanced) as in 
Eq. (ED, in addition to the (Normal) sum of squared errors. The 
algorithms were trained such that they have equal ISz score, to ex¬ 
amine the differences between the other metrics and the resulting 
error distributions. The box plots for the “Normal”, “Balanced” and 
“Normalized” are shown in EigurejS] The figures show the perfor¬ 
mance on the held out test sample for the entire range to demon¬ 
strate the effect more clearly. Cost-sensitive learning is more con¬ 
sistent across the redshift range as opposed to the normal sum of 
squares, especially in the high redshift regions where there are less 
data. The confidence intervals are also considerably smaller for the 
“Balanced” case. The “Normalized” training on the other hand re¬ 
sults in a systematic bias, as expected, towards the lower part of 
the redshift range. The performance comparison for the “Normal”, 
“Balanced” and “Normalized” training are summarized in Tablej^ 
but the metrics are reported on the desired range of 0 < < 2. 

Balanced training shows a better generalization performance, as it 
outperforms the normal sum of squares objective on the test sam¬ 
ple and has lower maximum errors, although the differences are 
generally small. 


6.4 Size of the Training Sample 

Thus far, we have only considered the case of having a large (80 
per cent of the total data) training sample. In practice, it is likely 
that the training-validation-test will be substantially smaller than 
the dataset for which photometric redshifts are required. Therefore, 
in this section the generalization performance of the models are 
tested by limiting the training sample size to different percentages 
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Table 2. Performance metrics used to evaluate the models. The number of samples is denoted by n. 


Metric 

Equation 

Description 

S(i) 

Ai) Ai) 

^p 

Error for the 2 -th object 

^(») 

'^norm 

<5W/(i + 2W) 

Normalized error for the 2 -th object 

Az 



Root mean squared error 

^Znorm 

>/ 

^ (siiorm) 

Normalized root mean squared error 

maxz 

max({ 1 , ■ ■ • I |}) 

Maximum eiTor 

max Tiorm 

/L(l) 1 |\\ 

iiiaA i \ '-’norm ’ • • • ’ '^norm i f 

Maximum normalized error 

fiz 

1V"- 5(«) 

Bias 

f-^norm 

1 v-'n r(^) 

2^i=l ^norm 

Normalized bias 

<yz 



Standard deviation of the errors 

O' norm 

/ 

“ ~ fJ'norm^ 

Standard deviation of the normalized errors 

OUtz 

1 

{i : > 2(Tz} 1 

Fraction of errors above two standai'd deviations from the mean 

OUtnorm 

n 

■^2 : S^orm ^ 2(T7[,orm^| 

Fraction of normalized errors above two standard deviations from the mean 


Table 3. Performance measures for each algorithm trained on the simulated survey using m = 10 basis functions. The best-performing algorithm is highlighted 
in bold font 



Az 

^-^norm 

max 2 

Ti\3Xnorm 

fJ'Z 

f-^norm 

Oz 

O norm 

OUtz 

OUtTiorm 

ANNz 

0.0848 

0.0568 

1.0126 

0.4199 

-0.0025 

-0.0048 

0.0847 

0.0566 

0.0505 

0.0532 

stableGP 

0.4399 

0.2836 

1.5906 

1.4441 

-0.0085 

-0.0365 

0.4399 

0.2812 

0.0509 

0.0539 

GP-GL 

0.1420 

0.0952 

1.1183 

0.5953 

-0.0064 

-0.0106 

0.1418 

0.0946 

0.0548 

0.0530 

GP-VL 

0.1251 

0.0833 

1.0349 

0.7953 

-0.0074 

-0.0094 

0.1249 

0.0828 

0.0549 

0.0552 

GP-VC 

0.0435 

0.0294 

0.5488 

0.5380 

-0.0005 

-0.0007 

0.0435 

0.0294 

0.0487 

0.0473 




2.5r 
2 
1.5 
t?- 1 







(a) ANNZ 


(b) stableGP 


0.5 1 

(c) GP-GL 


1.5 




Figure 2. Density scatter plots of the true 2s vs the predicted Zp for (a) ANNz, (b) STABLEGP, (c) GP-GL, (d) GP-VL and (e) GP-VC using m = 10 basis 
functions trained on the simulated dataset. The plots shows the performance on the same test sample, the colours however are scaled differently according to 
the density range in each plot to avoid colour saturation 
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Figure 3. Density scatter plots of the true 2s versus the predicted 2 p after training the GP-VC model on samples with RIZ < 23 (top) and RIZ < 22 
(bottom) from the simulated data using m = 10 basis functions with (a) a zero mean prior, (b) linear regression prior and (c) a joint prior optimization and 
(d) ANNz. The plots shows the performance on the same test sample, the colours however are scaled differently according to the density range in each plot to 
avoid colour saturation 


Table 4. The value of Az for the GP-VC model when trained on the simulated survey using m = 10 basis functions with different prior mean functions and 
RIZ splits. The results for ANNz are shown for comparison 


Trained on RIZ <21 RIZ < 23 Full 

Tested on <22 ^ 22 Full <23 ^23 Full Full 


ANNz 

0.0385 

0.1383 

0.1325 

0.0537 

0.1458 

0.1315 

0.0848 

Zero 

0.0233 

0.2539 

0.2424 

0.0362 

0.1261 

0.1129 

0.0435 

Linear 

0.0199 

0.1043 

0.0997 

0.0321 

0.1097 

0.0983 

0.0412 

Joint 

0.0192 

0.0982 

0.0939 

0.0277 

0.0653 

0.0593 

0.0298 


of the dataset. The validation and test samples were fixed to the 
same samples used in previous experiments to ensure consistent 
reporting on the same test sample. The models were trained using 
various percentages from 5% to 80%, once using a small number of 
basis functions (m = 10) and a second experiment using a larger 
number of basis functions (m — 100) and the results are shown for 
both in Figures|^and|^respectively. GP-VC consistently outper¬ 
forms the other models across the training range using both sim¬ 
ple and complex models. ANNz on the other hand performs poorly 
and quickly overfits in the complex version of the test. It is worth 
noting that, unlike the other models, ANNz consistently shows an 
unsteady performance in all of the parameter tuning tests for the 
simulated data set. However, we note that on real noisy data this 
problem diminishes (Sec.|7j. 

6.5 Size of the basis set 

Until now, we have limited the number of basis functions to 10, 
except for the last section to test the generalization performance 
of the models. In practice, the only limitation on the number of 
basis functions used for training the GP is computing resources. 
In this section we investigate how the accuracy of the photometric 
redshifts depends on the number of basis functions. 

We cross-compare all of the models by varying the number of 
basis functions m from 5 to 200 by an increment of 5 to study the 
relationship between accuracy and complexity. Az as a function of 


m for all the models are shown in Figure|^ the j/-axis is shown on a 
log scale for the purpose of visualisation. The STABLEGP method 
exhibits the worst performance across the board, especially when 
the number of basis functions is small. On the other hand, GP- 
VC consistently outperforms the rest, and most significantly when 
trained with few basis functions. ANNz outperforms GP-GL and 
GP-VL, but it does not scale well with complexity as it starts to 
overfit after m = 30. All the models were trained using a sum of 
squared errors objective with no cost-sensitive learning or a prior 
mean function optimization in this experiment. 

In Figure|7]we show the values of Az and Aznorm for the GP- 
VC approach using an extended range of basis functions of 5, 10, 
25, 50, 100, 200, 400, 800 and 1600 with joint linear optimization, 
using both the normalized weights and normal sum of squares. With 
the GP-VC we obtain Aznorm = 0.0295 with just m = 5 basis 
functions, and when using m = 1600 we obtain Aznorm = 0.0026 
and a maximum normalized error A2:„onn = 0.0652. We note that 
although the training complexity costs require effort for large num¬ 
bers of basis functions, once all parameters are inferred we enjoy 
effectively a linear basis model performance running over unseen 
(test) data. We therefore consider the performance for a realistic, 
yet large, number of functions. For an in depth analysis of feature 
selection, magnitude cuts and t raining sample size on photometric 
redshift the reader is referred to lHovle et alj ( l2015h . 

We also generated photometric redshifts from a committee of 
five neural networks using a two-layer architecture, each layer with 
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Table 5. Performance measures of training the GP-VC model using m = 10 basis functions and different weighting schemes trained on the simulated survey. 



Az 

^^norm 

maxz 

^^^norm 


f-^norm 

O-z 

^norm 

OUt2 

^^^norm 

Normal 

0.0500 

0.0337 

0.6128 

0.6008 

-0.0017 

-0.0016 

0.0500 

0.0336 

0.0507 

0.0507 

Balanced 

0.0500 

0.0324 

0.4933 

0.3419 

0.0007 

0.0001 

0.0500 

0.0324 

0.0510 

0.0510 

Normalized 

0.0500 

0.0280 

0.6389 

0.2862 

0.0008 

-0.0005 

0.0500 

0.0280 

0.0458 

0.0498 



(a) Normal 



(b) Balanced 



(c) Normalized 

Figure 4. Box plots of residual errors on the hold-out test sample, showing 
median (bar), inter-quartile range (box) and range (whiskers) for (a) the di¬ 
rect sum of squared errors, (b) the balanced cost-sensitive learning and (c) 
the normalized cost learning for the GV-VC model trained on the simulated 
data using m = 10 basis functions. The right-most histograms are the em¬ 
pirical densities of errors. Figures (a) and (b) have similar scales, but note 
the scale difference in (c). 

twice the nu mber of hidden uni t s as t he number of filters as rec¬ 
ommended in ICollister & Lah^ ( l2004h and has become a standard 
for most ANNz users. The models were trained using the same 
training and validation samples, and the quoted results were calcu¬ 
lated on the test sample. The results of the final GP-VC and ANNz 
models are summarized in Table and the density scatter plots for 
the final models are shown in Figure[8]for comparison. We find that 
both learn the underlying (simulated) relationship, GP-VC however 
provides a factor of 7 improvement in the accuracy of Az and 
Aznorm over the commonly-used ANNz architecture for the simu¬ 
lated data set. 


7 RESULTS ON SDSS DATA 

In this section we compare ANNz with GP-VC on data from the 
SDSS 12th Data Release. The data used for training was selected 
from all the galaxies in the database where photometric and spec¬ 
troscopic data are available and any sources with missing data was 


|-»-ANNz « slableGP » GP-GL - » - GP-VL - » - GP-^ 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

Percentage of Training 

(a) 10 Basis Functions 


- • - ANNz • stableGP o GP-GL - • - GP-VL - • - GP-VC 



Percentage of Training 
(b) 100 Basis Functions 

Figure 5. Az as a function of training size for all the methods using a 
simple model (m = 10) and a complex model (m = 100) trained on the 
simulated data. 

excluded from training. The modelMag magnitudes were used 
with their associated error estimates. The following SQL statement 
was used to extract the data from the SDSS DR12 database using 
the CasJobs service provided by SDSfl 

SELECT 
p.objid, 

p.modelMag_u, p.modelMag_g, 
p.modelMag_r, p.modelMag_i, 
p.modelMag_z, p.modelMagerr_u, 
p. modelMagerr_g, p .inodelMagerr_r, 
p.modelMagerr_i, p.modelMagerr_z, 
s.z as zspec, s.zErr as zspecErr, 
s.survey as survey 


casjobs.sdss.org 









































A Sparse GP Framework for Photometric Redshift 11 


Table 6. Performance measures for the final ANNz model using a committee of 5 networks with 8:16:16:1 architectures and the final GP-VC model trained 
on the simulated survey using m = 1600 basis functions with a jointly optimized linear function on the simulated survey. 



Az 

^^norm 

maxz 

max Tiorm 


fJ'norm 

CTz 

^norm 

OUtz 

OUtnorm 

ANNz 

0.0262 

0.0180 

0.3696 

0.3391 

-0.0004 

-0.0007 

0.0262 

0.0180 

0.0433 

0.0406 

GP-VC 

0.0041 

0.0026 

0.0764 

0.0652 

0.0000 

0.0000 

0.0041 

0.0026 

0.0480 

0.0460 


lo” 


<1 

10 "' 


0 20 40 60 80 100 120 140 160 180 200 

Number of Basis 


Figure 6. Az as a function of the number of basis functions for all the 
methods. 
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Figure 7. A log-log plot reporting the Az and Aznorm scores after training 
GP-VC models on the simulated data with 5, 10, 25, 50, 100, 200, 400, 800 
and 1600 basis functions with joint linear optimization. The results for the 
Az were optimized using normal sum of squares whereas the results for the 
^^novm were optimized using normalized weights 



(a) ANNz 



(b) GP-VC 

Figure 8. The density scatter plot for (a) the final ANNz model using a 
committee of 5 networks with 8:16:16:1 architectures and (b) the final GP- 
VC model trained using m = 1600 basis functions with a jointly optimized 
linear mean function on the simulated survey. 


INTO 

mydb.modelmag_dataset 
FROM 

PhotoObjAll as p, SpecObj as s 
WHERE 

p.SpecObjID = s.SpecObjID AND 
S. class = 'GALAXY' AND 
s.zWarning = 0 AND 
p.mode = 1 AND 


dbo.fPhotoFlags('PEAKCENTER') != 0 AND 

dbo.fPhotoFlags('NOTCHECKED') != 0 AND 

dbo.fPhotoFlags('DEBLEND_NOPEAK') != 0 AND 

dbo.fPhotoFlags ('PSF_FLUX_INTERP' ) != 0 AND 

dbo.fPhotoFlags ('BAD_COUNTS_ERROR' ) != 0 AND 

dbo.fPhotoFlags('INTERP_CENTER') != 0 

W e use similar image flags to the ones used in iBrescia et alJ 
( l2014h . To target the original spectroscopic limit of the SDSS, the 
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models are also tested with a cutoff of r < 17.7 applied. Four 
different data sets were created from the retrieved data: 

(i) SDSS: This data set includes only sources from the SDSS but 
not the BOSS survey (817,604 sources) 

(ii) SDSS with cut: This data set includes only the sources from 
the SDSS data set with r < 17.7 (577,725 sources). 

(iii) SDSS+BOSS: This data set includes all sources from the 
BOSS and the SDSS surveys (2,120,465 sources). 

(iv) SDSS+BOSS with cut: This data set includes only the 
sources from the SDSS+BOSS data set with r < 17.7 (629,117 
sources). 

The distribution of the spectroscopic redshifts of the data sets 
are shown in Figure|3 Similar to the simulated data setup, we used 
80% for training, 10% for validation and 10% for testing in each 
data set. 


7.1 Varying the Degree of Freedom 

In this experiment, we compare ANNz to GP-VC’s performance 
based on the number of free parameters. This is achieved by set¬ 
ting the number of basis functions in GP-VC such that the number 
of free parameters are approximately equal to the number of free 
parameters in a two-layer neural network using the following as¬ 
signment: 


'^9 = rUg [(f + d) + d + 1 , 
Vz = ml -f dmz + 3mz + 1, 
-I- dmz + 3mz — d 


(25a) 

(25b) 

(25c) 


where "Dg is the degree of freedom in GP-VC, with joint prior mean 
function, Dz is the degree of freedom in ANNz, rrig is the number 
of basis functions in GP-VC and is the number of hidden units 
in ANNz. The number of hidden units is set to be equal in both 
layers. Setting the number of basis functions in GP-VC according 
to J25ct . ensures that T>g « T>z. 

In this test, we trained a two-layer ANNz architecture using 
10-100 hidden units and a matching GP-VC model based on Eq. 
( I25cl > with joint mean optimization. The number of hidden units 
was set to be equal in both layers but only a single network was 
used to generate the predictions not a committee of five networks. 
Both models were trained on the SDSS data set and the results on 
the test set are shown in Figure [TO] The results are consistent with 
the results from the simulated data, the performance of ANNz de¬ 
grades as we increase the complexity of the network, whereas GP- 
VC is more robust and shows a steady improvement. 


7.2 Final Results 

In this experiment, photometric redshifts were generated using a 
committee of five two-layer architectures, each layer with twice 
the number of hidden units as the number of filters and compared 
to the best performing GP-VC model from the previous test. The 
experiment was carried out on the four data sets, the performance 
measures are reported in tables 17^ I7bl iTcl and I7dl for the SDSS, 
SDSS with cut, SDSS+BOSS and SDSS+BOSS with cut respec¬ 
tively. The density scatter plots for the SDSS and SDSS+BOSS 
data sets are shown in Figure fTTI andl 12lresDectivelv. 

Although not as stark as for the simulated data set, our GP-VC 



Figure 10. A comparison between GP-VC and ANNz with two layers on 
the SDSS dataset using various degrees of freedom. The numbers on top 
of each point is the equivalent number of basis functions in GPVC and the 
number of hidden units for each layer in ANNz. 


algorithm consistently outperforms ANNz on the important met¬ 
rics (Az and Aznorm) and from examining the plots, the outliers 
are less extreme in the GP-VC case, with an increase in accuracy of 
around 5 per cent for the full SDSS+BOSS data set. However, we 
note that given the overall performance of GP-VC on the simulated 
data set, we expect the real power of our GP-VC to algorithm to be 
shown in the sparse data regime, and we reserve such an analysis 
to a future paper. 


8 SUMMARY AND FUTURE WORK 

In this paper a sparse Gaussian process framework is presented 
and applied to photometric redshift estimation. The framework is 
able to out perform ANNz, sparse GP parametrized by a set of 
global hyper-parameters and low rank approximation GP. The per¬ 
formance increase is attributed to the handling of distribution bias 
via a weighting scheme integrated as part of the optimization objec¬ 
tive, parametrizing each basis function with bespoke covariances, 
and integrating the learning of the prior mean function to enhance 
the extrapolation performance of the model. The methods were ap¬ 
plied to a simulated dataset and SDSS DR12 where the proposed 
approach consistently outperforms the other models on the impor¬ 
tant metrics (Az and Aznorm)- We find that the model scales lin¬ 
early in time with respect to the size of the data, and has a better 
generalization performance compared to the other methods even 
when presented with a limited training set. Results show that with 
only 30 per cent of the data, the model was able to reach accu¬ 
racy close to that of using the full training sample. Even when data 
were selectively removed based on RIZ magnitudes, the model 
was able to show the best recovery performance compared to the 
other models. The cost-sensitive learning component of the frame¬ 
work regularizes the predictions to limit the effect caused by the 
biased distribution of the output and allows for direct optimiza¬ 
tion of the survey objective (e.g. Znorm = \zs — 2p|/(l + Zs)). 
Again, the algorithm consistently outperforms other approaches, 
including ANNz and STABLEGP, in all reported experiments. We 
also investigate how the size of the training sample and the ba¬ 
sis set affects the accuracy of the photometric redshift prediction. 
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(a) SDSS (b) SDSS+BOSS 

Figure 9. The distribution of the spectroscopic redshift in the (a) SDSS and (b) SDSS+BOSS datasets. The top figures show the distributions with the r < 17.7 
cut while the bottom figures show the distributions without the cut. 


Table 7. Performance measures for the final ANNz model using a committee of 5 networks with 5:10:10:1 architectures and the final GP-VC model using 
m = 103 basis functions with a jointly optimized linear function on all the data sets. 



Az 

A2:noiTn 

maxz 

maXnorm 


l^norm 

(7z 

^norm 

outz 

OUtnorm 

ANNz 

GP-VC 

0.0308 

0.0302 

0.0249 

0.0244 

0.8689 

0.8678 

0.4478 

0.5017 

0.0000 

0.0000 

-0.0007 

-0.0006 

0.0308 

0.0302 

0.0249 

0.0244 

0.0327 

0.0316 

0.0378 

0.0365 






(a) SDSS 







Az 


maxz 

VkVQXYiorm t-^z 

l-^norm 

(Fz 

^norm 

OUt2 

^^^norm 

ANNz 

GP-VC 

0.0221 

0.0209 

0.0190 

0.0179 

0.8710 

0.8562 

0.4489 

0.4413 

0.0002 

0.0001 

-0.0002 

-0.0003 

0.0221 

0.0209 

0.0190 

0.0179 

0.0397 

0.0386 

0.0469 

0.0456 





(b) SDSS with cut 






Az 

^•2norm 

maxz 

maXTiorm 


fJ'norm 

(Fz 

(Fnorm 

OUtz 

G\if.norm 

ANNz 

GP-VC 

0.0539 

0.0513 

0.0386 

0.0366 

1.3113 

1.4284 

0.7457 

0.8601 

-0.0000 

-0.0001 

-0.0015 

-0.0013 

0.0539 

0.0513 

0.0386 

0.0366 

0.0393 

0.0385 

0.0346 

0.0340 





(c) SDSS-hBOSS 






Az 

^-^norm 

max2 

max Tiorm 

rz 

f^norm 

(Fz 

(Fnorm 

OUtz 

GM^-norm 

ANNz 

GP-VC 

0.0222 

0.0207 

0.0188 

0.0178 

0.8523 

0.9831 

0.4421 

0.5043 

-0.0000 

-0.0000 

-0.0004 

-0.0003 

0.0222 

0.0207 

0.0188 

0.0178 

0.0367 

0.0376 

0.0445 

0.0445 


(d) SDSS+BOSS with cut 
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(a) ANNZ 


(b) GP-VC 


Figure 11. The density scatter plot for (a) the final ANNz model using a committee of 5 networks with 5:10:10:1 architectures and (b) the final GP-VC 
model trained using m = 103 basis functions with a jointly optimized linear mean function. The top figures show the plots for the SDSS data set with out the 
r < 17.7 cut, while the bottom figures show the plots for the SDSS with cut. 


W e show that f o r the simulated set of galaxies, based on the work 
of Ijouvel et alj < l20Q9h . we are able to obtain a photometric red- 
shift accuracy of Aznorm = 0.0026 and rnaxnorm = 0.0652 us¬ 
ing 1600 basis functions which is a factor of seven improvement 
over the standard ANNz implementation. We find that GP-VC out¬ 
performed ANNz on the real data from SDSS-DR12, with an im¬ 
provement in accuracy of ~ 5 per cent, even when restricted to 
have the same number of free parameters. In future work we will 
test the algorithm on a range of real data, and pursue investigations 
of how the algorithm performs over different redshift regimes and 
for different galaxy types. 
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(a) ANNZ 
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trained using m = 103 basis functions with a jointly optimized linear mean function. The top figures show the plots for the SDSS-I-BOSS data set with out 
the r < 17.7 cut, while the bottom figures show the plots for the SDSS-hBOSS with cut. 
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